Econometrics and Statistics (520K)
Linear Regression with One Regressor (Part 3)
1) Create your (working) directory
Create a new folder on your computer in which you can save all necessary files, i.e., the data set, the R script with the implementation of your analysis and the R helper function script with some helper functions provided by us.
This folder, i.e., the path or directory of this folder, will be your so-called working directory.
2) Download the data set
Download the data sets CASchools_01_data.txt and
CPS_01_data.txt from ILIAS directly
or by using the link below:
Download
CASchools_01_data.txt
Save the data set into a (working) directory of your choice.
3) Prepare script
Open a new R script and save it into into your (working) directory.
Run the following code chunk to prepare your R script, i.e., to clear the work space and set your working directory.
# Clear workspace
rm(list = ls(all = TRUE))
# Set working directory
setwd(".../set/your/working/directory/...")
4) Install/load packages and download/save/include helper function script
Step 1) Install all
relevant R packages and download and
save the R script
r_helper_functions.R from ILIAS by
running the following code chunk.
# Download R-helper functions
source("https://ilias.uni-hohenheim.de/data/UHOH/lm_data/lm_1856939/MA_EconometricMethods_WiSe2324_PracticalClass/r-scripts/prepare_r_packages_and_helper_functions.R")
Note, the required R
packages are "moments", "sandwich",
"lmtest", "lmtest", "car",
"plm", "ivreg", "dynlm",
"forecast", "urca".
Step 2) Include the
R script
r_helper_functions.R into your R
script by running the following code chunk.
# Include R-helper functions
source("r_helper_functions.R")
Note, more information on the helper functions can be found here.
Student teacher ratio and student performance
We continue our analysis of the relationship between student performance and class size.
For this we use again the data set CASchool_01_data.txt
and the two variables \(test score\)
and \(STR\) as obtained in the first
exercise sheet.
Variables
| Variable | Description |
|---|---|
| \(students\) | Total enrollment. |
| \(teachers\) | Number of teachers. |
| \(read\) | Average reading score. |
| \(math\) | Average math score. |
Load data, construct variables and estimation results.
Load data set
# use absolute or relative path to load data
cas.dat <- read.table(file = "CASchools_01_data.txt",
header = TRUE,
sep = ",",
colClasses = c("numeric","numeric","numeric","numeric"))
Construct variables
# compute STR and append it to cas.dat
cas.dat$STR <- cas.dat$students/cas.dat$teachers
# compute TestScore and append it to cas.dat
cas.dat$score <- (cas.dat$read + cas.dat$math)/2
Estimation results No 01
# estimate linear model
lm.res.01 <- lm(score ~ STR, data = cas.dat)
# show results
lm.res.01
The estimation results below include robust standard errors which are
not part of the results from the lm() function above.
\(\Rightarrow\) See: implementation below!
##
## =============================
## Model 1
## -----------------------------
## intercept 698.93 (10.34) ***
## STR -2.28 (0.52) ***
## -----------------------------
## R^2 0.05
## SER 18.58
## =============================
## *** p < 0.001; ** p < 0.01; * p < 0.05
Replicate the results in Table 1 of the exercise sheet.
Estimate linear regression model
# Estimate linear regression model
lm.res.01 <- lm(score ~ STR, data = cas.dat)
summary(lm.res.01)
##
## Call:
## lm(formula = score ~ STR, data = cas.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.727 -14.251 0.483 12.822 48.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
## STR -2.2798 0.4798 -4.751 2.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
Scatter Plot with Fitted Regression Line
# plot parameters
par(mfrow=c(1, 1),
mar=c(5, 5, 3, 2))
# scatter plot
plot(cas.dat$STR, cas.dat$score,
main = "Scatter Plot of Test Score and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)")
abline(lm.res.01, col = "red", lty = 2)
Standard error robust w.r.t. heteroskedasticity: Use wrapper
function lm_ct_fun() from above
# call wrapper function for robust inference
lin.mod.rob <- lm_ct_fun(score ~ STR, data = cas.dat, hc.type = "HC1")
# print results
lin.mod.rob$ct
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.93295 10.36436 67.4362 < 2.2e-16 ***
## STR -2.27981 0.51949 -4.3886 1.447e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sampling distribution of \(\widehat{\beta}_1\)
\[\begin{align} \widehat{\beta}_1 \sim N\left(\beta_1, \sigma^2_{\widehat{\beta}_1} \right) \;\;\; \text{(approximately)} \end{align}\]
with (heteroskedasticity robust)
\[\begin{align} \sigma^2_{\widehat{\beta}_1}=\frac{1}{n}\frac{\mathrm{var}\left[\left(X_i-\mu_X\right)u_i\right]}{\left[\mathrm{var}\left(X_i\right)\right]^2} \end{align}\]
under homoskedasticity
\[\begin{align} \tilde{\sigma}^2_{\widehat{\beta}_1}=\frac{1}{n}\frac{\mathrm{var}\left[u_i\right]}{\left[\mathrm{var}\left(X_i\right)\right]^2} \end{align}\]
Important
Add your interpretation here!
Test whether \(STR\) has a significant effect on \(score\).
Conduct inference
Based on homoskedasticity only
# call wrapper function lm_ct_fun (here, using homoskedastic ses)
lm.ct.ord.01 <- lm_ct_fun(score ~ STR, data = cas.dat, hc.type = "const")
ct.ord.01 <- lm.ct.ord.01$ct
ct.ord.01
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.93295 9.46749 73.8245 < 2.2e-16 ***
## STR -2.27981 0.47983 -4.7513 2.783e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Robust w.r.t. heteroskedasticity
# call wrapper function lm_ct_fun (here, using robust ses)
lm.ct.rob.01 <- lm_ct_fun(score ~ STR, data = cas.dat, hc.type = "HC1")
ct.rob.01 <- lm.ct.rob.01$ct
ct.rob.01
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.93295 10.36436 67.4362 < 2.2e-16 ***
## STR -2.27981 0.51949 -4.3886 1.447e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confidence interval based on homoskedasticity only standard errors
# 95% ci bounds based on homoskedasticity
ci.95.up.ord.01 <- ct.ord.01[,1] + 1.96 * ct.ord.01[,2] # upper bound
ci.95.lo.ord.01 <- ct.ord.01[,1] - 1.96 * ct.ord.01[,2] # lower bound
# results
cbind(ci.95.lo.ord.01, ci.95.up.ord.01)
## ci.95.lo.ord.01 ci.95.up.ord.01
## (Intercept) 680.376667 717.48923
## STR -3.220266 -1.33935
Confidence interval based on heteroskedasticity robust standard errors
# 95% ci bounds based on heteroskedasticity
ci.95.up.rob.01 <- ct.rob.01[,1] + 1.96 * ct.rob.01[,2] # upper bound
ci.95.lo.rob.01 <- ct.rob.01[,1] - 1.96 * ct.rob.01[,2] # lower bound
# results
cbind(ci.95.lo.rob.01, ci.95.up.rob.01)
## ci.95.lo.rob.01 ci.95.up.rob.01
## (Intercept) 678.618800 719.247098
## STR -3.298007 -1.261609
Interpretation
General (for slope parameter)
\[\begin{align} H_0&: \quad \beta_{1} = \beta_{1,0} \quad \text{vs.} \\ H_1&: \quad \beta_{1} \neq \beta_{1,0} \quad \alpha = 5\% \end{align}\]
Here
\[\begin{align} H_0&: \quad \beta_{STR} = 0 \quad \text{vs.} \\ H_1&: \quad \beta_{STR} \neq 0 \quad \alpha = 5\% \end{align}\]
Add your results here!
\[\begin{align} t^{act} &= \frac{\widehat{\beta}_{STR}^{act} - 0}{SE\left(\widehat{\beta}_{STR}\right)} \\ t^{act} &= \frac{-2.28 - 0}{0.519} \\ t^{act} &= -4.393 \end{align}\]
Add your results here!
Add your results here!
\[\begin{align} p-\text{value} &= 2*\Phi\left(-|t^{act}|\right) \\ &= 2*\Phi\left(-4.389\right) \\ &= 1.45\times 10^{-5} \end{align}\]
Add your results here!
Conclusion
Definition \(p\)-value
Add your results here!
ill_hyp_tes_fun(t.act = -4.39, test = "two.sid")
Add your interpretation here!
Illustration
CDF Standard Normal Distribution
Source: Stock and Watson (2020), p. 763-764
Related to the `superintendent’s question’ from the lecture construct the \(95\%\) confidence interval for the effect of reducing the student-teacher ratio by \(2\).
Confidence interval for \(\underline{\Delta STR = -2}\)
# 95% ci bounds for delta x = 2
ci.95.d2.up <- (ct.rob.01[2,1] + 1.96 * ct.rob.01[2,2]) * (-2) # upper bound
ci.95.d2.lo <- (ct.rob.01[2,1] - 1.96 * ct.rob.01[2,2]) * (-2) # lower bound
# results
cbind(ci.95.d2.lo, ci.95.d2.up)
## ci.95.d2.lo ci.95.d2.up
## [1,] 6.596014 2.523218
Interpretation
Confidence interval for \(\underline{\beta_1}\)
\[\begin{align} \left[\widehat{\beta}_1 \pm 1.96 * \text{SE}\left(\widehat{\beta}_1\right) \right] \end{align}\]
Confidence interval for \(\underline{\beta_1 \Delta x}\)
Thus,
\[\begin{align} \left[\left(\widehat{\beta}_1 \pm 1.96 * \text{SE}\left(\widehat{\beta}_1\right) \right)*\Delta x\right] \end{align}\]
Here
Use robust standard errors
\[\begin{align} CI &= \left[\left(-2.28\pm 1.96\times 0.519\right) \times \left( -2 \right) \right] \\ &= \left[6.596,2.523\right] \end{align}\]
The \(95\%\) confidence interval for the slope coefficient does not include the zero.
Add your interpretation here!
Recoding \(STR\) into a binary variable for small and large class sizes and replicate the analysis from Chapter 2.8 of the lecture.
Binary regression model
Binary regressors
Regression model
\[\begin{align} Y_i=\beta_0 + \beta_1 X_i + u_i \end{align}\]
Coefficients
\[\begin{align} \mathrm{E}\left(Y_i|X_i=0\right)&=\beta_0 \\ \mathrm{E}\left(Y_i|X_i=1\right)&=\beta_0+\beta_1 \end{align}\]
\[\begin{align} \beta_1&=\mathrm{E}\left(Y_i|X_i=1\right)-\mathrm{E}\left(Y_i|X_i=0\right) \\ &= \text{population difference in group means} \end{align}\]
\(\Rightarrow\) Test if difference in means, i.e. \(\beta_1\), is different from zero!
Binary regressors
Regression model
\[\begin{align} Y_i=\beta_0 + \beta_1 X_i + u \end{align}\]
Coefficients
\[\begin{align} \mathrm{E}\left(Y_i|X_i=0\right)&=\beta_0 \\ \mathrm{E}\left(Y_i|X_i=1\right)&=\beta_0+\beta_1 \end{align}\]
\[\begin{align} \beta_1&=\mathrm{E}\left(Y_i|X_i=1\right)-\mathrm{E}\left(Y_i|X_i=0\right) \\ &= \text{population difference in group means} \end{align}\]
\(\Rightarrow\) Test if difference in means, i.e. \(\beta_1\), is different from zero!
Group means
# recoding of STR into binary variable
cas.dat$small <- round(cas.dat$STR, 2) < 20
# conditional mean
mea.sma <- mean(cas.dat[round(cas.dat$STR, 2) < 20, "score"])
mea.sma
mea.lar <- mean(cas.dat[round(cas.dat$STR, 2) >= 20, "score"])
mea.lar
# conditional sd
sd.sma <- sd(cas.dat[round(cas.dat$STR, 2) < 20, "score"])
sd.sma
sd.lar <- sd(cas.dat[round(cas.dat$STR, 2) >= 20, "score"])
sd.lar
sd.all <- sd(cas.dat[, "score"])
# number of observations
nn.sma <- length(cas.dat[round(cas.dat$STR, 2) < 20, "score"])
nn.sma
nn.lar <- length(cas.dat[round(cas.dat$STR, 2) >= 20, "score"])
nn.lar
nn.all <- length(cas.dat[, "score"])
nn.all
# recoding of STR into binary variable
cas.dat$small <- round(cas.dat$STR, 2) < 20
# conditional mean
mea.sma <- mean(cas.dat[round(cas.dat$STR, 2) < 20, "score"])
mea.sma
## [1] 657.3513
mea.lar <- mean(cas.dat[round(cas.dat$STR, 2) >= 20, "score"])
mea.lar
## [1] 649.9788
# conditional sd
sd.sma <- sd(cas.dat[round(cas.dat$STR, 2) < 20, "score"])
sd.sma
## [1] 19.35801
sd.lar <- sd(cas.dat[round(cas.dat$STR, 2) >= 20, "score"])
sd.lar
## [1] 17.85336
sd.all <- sd(cas.dat[, "score"])
# number of observations
nn.sma <- length(cas.dat[round(cas.dat$STR, 2) < 20, "score"])
nn.sma
## [1] 238
nn.lar <- length(cas.dat[round(cas.dat$STR, 2) >= 20, "score"])
nn.lar
## [1] 182
nn.all <- length(cas.dat[, "score"])
nn.all
## [1] 420
Regression results
# recoding of STR into binary variable
cas.dat$small <- round(cas.dat$STR,2) < 20
# call wrapper function lm_ct_fun (here, using robust ses)
lm.ct.rob.01.dd <- lm_ct_fun(score ~ small, data = cas.dat, hc.type = "HC1")
ct.rob.01.dd <- lm.ct.rob.01.dd$ct
ct.rob.01.dd
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 649.9788 1.3229 491.3317 < 2.2e-16 ***
## smallTRUE 7.3724 1.8236 4.0428 6.288e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis
\[\begin{align} H_0&: \quad \beta_{1} = 0 \quad \text{vs.} \\ H_1&: \quad \beta_{1} \neq 0 \quad \alpha = 5\% \end{align}\]
Decision
Use \(t-\text{value}\) (based on robust SE’s):
Use \(p-\text{value}\) (based on robust SE’s):
\(\Rightarrow\) Statistically significant \(score\) difference between small and large classes.
Note, Compare results to statistics review video.
Add your interpretation here!
Earnings and education
The data set cps_01_data.txt contains information about
earnings (in USD) and years of education of 2731 full-time workers.
Variables
| Variable | Description |
|---|---|
| \(ahe\) | Average hourly earnings. |
| \(yrseduc\) | Number of years of education. |
For more information see the example of Stock and Watson (2020) on page 193.
Load data, construct variables and estimation results.
Load data set
# use absolute or relative path to load data
cps.dat <- read.table(file = "CPS_01_data.txt",
header = TRUE,
sep = ",",
colClasses = c("numeric","numeric"))
Estimation results No 01
# estimate linear model
lm.res.02 <- lm(ahe ~ yrseduc, data = cps.dat)
# show results
lm.res.02
The estimation results below include robust standard errors which are
not part of the results from the lm() function above.
\(\Rightarrow\) See: implementation below!
##
## ============================
## Model 1
## ----------------------------
## intercept -12.12 (1.36) ***
## yearseduc 2.37 (0.10) ***
## ----------------------------
## R^2 0.18
## SER 11.24
## ============================
## *** p < 0.001; ** p < 0.01; * p < 0.05
Load the data set into your statistics software and provide a scatter plot of \(ahe\) (average hourly earnings) vs. \(yrseduc\) (years of education).
Scatter Plot
# plot parameters
par(mfrow=c(1, 1),
mar=c(5, 5, 3, 2))
# scatter plot
plot(cps.dat$yrseduc, cps.dat$ahe,
main = "Scatter Plot of Earnings and Years of Education",
xlab = "Years of education (X)",
ylab = "Average hourly earnings (Y)",
xlim = c(0, max(cps.dat$yrseduc)))
Add your interpretation here!
Run a regression of \(ahe\) on \(yrseduc\). State the underlying assumptions of a causal regression. Do you think these assumptions are valid in this case?
Estimate linear regression model
# Estimate linear regression model
lm.res.02 <- lm(ahe ~ yrseduc, data = cps.dat)
summary(lm.res.02)
##
## Call:
## lm(formula = ahe ~ yrseduc, data = cps.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.251 -6.757 -2.087 4.249 84.709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.11631 1.35518 -8.941 <2e-16 ***
## yrseduc 2.37404 0.09553 24.851 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.24 on 2729 degrees of freedom
## Multiple R-squared: 0.1845, Adjusted R-squared: 0.1842
## F-statistic: 617.6 on 1 and 2729 DF, p-value: < 2.2e-16
Scatter Plot with Fitted Regression Line
# plot parameters
par(mfrow=c(1, 1),
mar=c(5, 5, 3, 2))
# scatter plot
plot(cps.dat$yrseduc, cps.dat$ahe,
main = "Scatter Plot of Earnings and Years of Education",
xlab = "Years of education (X)",
ylab = "Average hourly earnings (Y)",
xlim = c(0, max(cps.dat$yrseduc)),
ylim = c(-12.5, max(cps.dat$ahe)))
abline(lm.res.02, col = "red", lty = 2)
Least Squares Assumption for causal inference
Add your interpretation here!
Discuss the usage of heteroskedasticity-robust standard errors.
Scatter Plot of Residuals
# plot parameters
par(mfrow=c(1, 1),
mar=c(5, 5, 3, 2))
# Residuals plot
plot(cps.dat$yrseduc, lm.res.02$residuals,
main = "Scatter Plot of Residuals (ahe) and Years of Education",
xlab = "Years of education (X)",
ylab = "Residuals (ahe) (Y)",
xlim = c(0, max(cps.dat$yrseduc)))
lm.res can be extracted using
lm.res$residuals.lm.res can be extracted using
lm.res$fitted.values.Homoskedasticity vs. heteroskedasticity
Add your interpretation here!
Use the appropriate standard errors to to construct a \(99\%\) confidence interval for the slope coefficient.
Inference based on ordinary SE’s
# call wrapper function lm_ct_fun (here, using ordinary ses)
lm.ct.ord.02 <- lm_ct_fun(ahe ~ yrseduc, data = cps.dat, hc.type = "const")
ct.ord.02 <- lm.ct.ord.02$ct
ct.ord.02
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.11631 1.35518 -8.9407 < 2.2e-16 ***
## yrseduc 2.37404 0.09553 24.8514 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Inference based on robust SE’s (w.r.t. heteroskedasticity)
# call wrapper function lm_ct_fun (here, using robust ses)
lm.ct.rob.02 <- lm_ct_fun(ahe ~ yrseduc, data = cps.dat, hc.type = "HC1")
ct.rob.02 <- lm.ct.rob.02$ct
ct.rob.02
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.11631 1.35557 -8.9381 < 2.2e-16 ***
## yrseduc 2.37404 0.10244 23.1744 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confidence interval based on ordinary SE’s
# 95% ci bounds based on homoskedasticity
ci.99.up.ord.02 <- ct.ord.02[,1] + 2.58*ct.ord.02[,2] # upper bound
ci.99.lo.ord.02 <- ct.ord.02[,1] - 2.58*ct.ord.02[,2] # lower bound
# results
cbind(ci.99.lo.ord.02,ci.99.up.ord.02)
## ci.99.lo.ord.02 ci.99.up.ord.02
## (Intercept) -15.612672 -8.619950
## yrseduc 2.127578 2.620511
Confidence interval based on robust SE’s (w.r.t. heteroskedasticity)
# 95% ci bounds based on heteroskedasticity
ci.99.up.rob.02 <- ct.rob.02[,1] + 2.58*ct.rob.02[,2] # upper bound
ci.99.lo.rob.02 <- ct.rob.02[,1] - 2.58*ct.rob.02[,2] # lower bound
# results
cbind(ci.99.lo.rob.02,ci.99.up.rob.02)
## ci.99.lo.rob.02 ci.99.up.rob.02
## (Intercept) -15.613689 -8.618933
## yrseduc 2.109743 2.638346
Use robust standard errors
\[\begin{align} CI &= \left[\widehat{\beta}_1\pm 2.58\times SE\left(\widehat{\beta}_1\right)\right] \\ &= \left[2.374\pm 2.58\times 0.102\right] \\ &= \left[2.11,2.638\right] \end{align}\]
The \(99\%\) confidence interval for the slope coefficient does not include the zero.
Add your interpretation here!
Test if the coefficient of \(yrseduc\) is significantly different than 2 (\(\alpha=0.05\)).
Interpretation
General (for slope parameter)
\[\begin{align} H_0&: \quad \beta_{1} = \beta_{1,0} \quad \text{vs.} \\ H_1&: \quad \beta_{1} \neq \beta_{1,0} \quad \alpha = 5\% \end{align}\]
Here
\[\begin{align} H_0&: \quad \beta_{yrseduc} = 2 \quad \text{vs.} \\ H_1&: \quad \beta_{yrseduc} \neq 2 \quad \alpha = 5\% \end{align}\]
Add your results here!
\[\begin{align} t^{act} &= \frac{\widehat{\beta}_{yrseduc}^{act} - 2}{SE\left(\widehat{\beta}_{yrseduc}\right)} \\ t^{act} &= \frac{2.374 - 2}{0.102} \\ t^{act} &= 3.667 \end{align}\]
Add your results here!
Add your results here!
\[\begin{align} p-\text{value} &= 2*\Phi\left(-|t^{act}|\right) \\ &= 2*\Phi\left(-3.667\right) \\ &= 0.0024 \end{align}\]
Add your results here!
Conclusion
Definition \(p\)-value
Add your results here!
ill_hyp_tes_fun(t.act = 3.67, test = "two.sid")
Illustration
CDF Standard Normal Distribution
Source: Stock and Watson (2020), p. 763-764